Handling Categorical Data with Pandas

pandas

dataframe

categorical-data

data-transformation

Comprehensive guide to handling categorical data in Pandas, including encoding techniques, grouping operations, and data reshaping methods like melt and pivot.

Author

Mohammed Adil Siraju

Published

September 21, 2025

This notebook covers essential techniques for working with categorical data in Pandas, including: - Encoding Methods: Converting categorical variables to numerical formats - Grouping Operations: Analyzing category distributions and aggregations - Data Transformation: Reshaping data with melt and pivot operations

Categorical data transformation is crucial for machine learning models that require numerical inputs.

1. Setting Up Sample Data

Let’s start by creating a sample DataFrame with categorical data to work with.

import pandas as pd

data = {
    'Category': ['A','B','C','C','B','A']
}

df = pd.DataFrame(data)

df

	Category
0	A
1	B
2	C
3	C
4	B
5	A

2. Encoding Categorical Data

Machine learning algorithms typically require numerical inputs. Categorical encoding converts text categories into numbers. Here are the most common techniques:

One-Hot Encoding

One-hot encoding creates binary columns for each category. It’s ideal for nominal (unordered) categories.

Pros: No ordinal assumptions, works well with most algorithms Cons: Can create many columns (curse of dimensionality)

pd.get_dummies(df['Category'])[['A','B']]

	A	B
0	True	False
1	False	True
2	False	False
3	False	False
4	False	True
5	True	False

Label Encoding

Label encoding assigns integer values to categories. Use this when categories have a natural order (ordinal data).

Pros: Memory efficient, preserves single column Cons: Implies ordinal relationship even when none exists

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

df['Category_LabenEncoded'] = label_encoder.fit_transform(df['Category'])

df

	Category	Category_LabenEncoded
0	A	0
1	B	1
2	C	2
3	C	2
4	B	1
5	A	0

import pandas as pd

data = {
    'Category': ['A','B','C','C','B','A']
}

df = pd.DataFrame(data)

df

	Category
0	A
1	B
2	C
3	C
4	B
5	A

3. Analyzing Categorical Data with Grouping

Grouping operations help you understand the distribution and patterns in your categorical data. This is essential for exploratory data analysis.

Counting Category Frequencies

Use groupby().size() or groupby().count() to see how many times each category appears.

df.groupby('Category').size()

Category
A    2
B    2
C    2
dtype: int64

df.groupby('Category').agg({'Category':'count'})

	Category
Category
A	2
B	2
C	2

4. Data Transformation: Reshaping with Melt and Pivot

Data reshaping is crucial for transforming your data between “wide” and “long” formats. This is particularly useful when working with categorical data across multiple variables.

Wide to Long Format (melt)

pd.melt() unpivots a DataFrame from wide format to long format. This is useful for: - Converting multiple categorical columns into a single column - Preparing data for visualization libraries - Making data more database-friendly

# Reshaping Data
data = {
    'Name': ['John', 'Emily', 'Kate'],
    'Math': [90, 85,88],
    'Science': [92, 80, 95]
}

df = pd.DataFrame(data)
df

	Name	Math	Science
0	John	90	92
1	Emily	85	80
2	Kate	88	95

df_melted = pd.melt(df, id_vars='Name', var_name='Subject', value_name='Score')
df_melted

	Name	Subject	Score
0	John	Math	90
1	Emily	Math	85
2	Kate	Math	88
3	John	Science	92
4	Emily	Science	80
5	Kate	Science	95

Long to Wide Format (pivot)

df.pivot() does the opposite of melt - it converts long format back to wide format. This is useful for: - Creating summary tables - Preparing data for certain types of analysis - Making data more human-readable

df_melted.pivot(index='Name', columns='Subject', values='Score')

Subject	Math	Science
Name
Emily	85	80
John	90	92
Kate	88	95

Summary

In this notebook, you learned essential data transformation techniques for categorical data:

Encoding: Convert text categories to numbers
- One-hot encoding for nominal data
- Label encoding for ordinal data
Grouping: Analyze category distributions
- Count frequencies with groupby().size()
- Aggregate data by categories
Reshaping: Transform data structure
- melt(): Wide to long format
- pivot(): Long to wide format

These techniques form the foundation of data preprocessing for machine learning and analysis workflows. Choose the right method based on your data characteristics and modeling requirements!

Next Steps: Practice with real datasets and explore advanced encoding techniques like target encoding or frequency encoding.